巴西专利BR112013003850B1 device and combination writing buffer method with dynamically adjustable emptying measures.

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
COMBINATION WRITING BUFFER WITH DYNAMICALLY ADJUSTABLE EMPTYING MEASURES. The present invention relates to a combination writing buffer which, in one embodiment, is configured to maintain one or more emptying measures to determine when to transmit write operations from buffer inputs. The combination write buffer can be configured to dynamically modify emptying measures in response to an activity in the write buffer, modifying the conditions under which write operations are transmitted from the write buffer to the next highest level. low memory. For example, in an implementation, emptying measures can include categorization write buffer entries, such as "collapsed". A collapsed write buffer entry and the write operations collapsed there can include at least one write operation that has overwritten data that was written by a previous write operation in the buffer entry. In another implementation, the combination of the writing buffer can maintain the limit of full buffer occupation as a measure of emptying and can adjust it over time, based on full occupation (...).
公开号:BR112013003850B1
申请号:R112013003850-0
申请日:2011-08-11
公开日:2020-12-01
发明作者:Peter J. Bannon；Andrew J. Beaumont-Smith；Ramesh Gunna；Wei-Han Lien；Jaidev P. Patwardhan；Brian P. Lilly；Shih-Chieh R. Wen；Tse-Yu Yeh
申请人:Apple Inc.；
IPC主号:

专利说明:

[0001] [0001] The present invention relates to the field of processors and, more particularly, to caching combination writing buffers. Description of the Related Art
[0002] [0002] Processors often implement combination write buffers to capture write operations that were written to a synchronous write cache in the cache and in storage (for example, an L1 cache), for buffering those writes , before updating a lower level cache (for example, an L2 cache). The combination write buffer combines two or more write operations that target data in the same cache block, and thus have less write to the L2 cache.
[0003] [0003] The combination write buffer can accumulate write operations for some time. Determining when to empty write operations from one or more combination write buffer entries is a compromise between bandwidth and performance. Buffering write operations in the combination write buffer can lead to better bandwidth efficiency. On the other hand, if the data is buffered too long, performance may suffer, as the data that needs to be pushed into lower level caches or memory remains in the combination write buffer. SUMMARY
[0004] [0004] In one embodiment, a combination write buffer is configured to maintain one or more emptying measures to determine when to transmit write operations from buffer inputs. The combination write buffer can be configured to dynamically modify emptying measures in response to an activity in the write buffer, modifying the conditions that cause write operations to be transmitted from the write buffer to the next highest level. low memory. Thus, the compromise of performance / bandwidth can be dynamically adjusted based on the detected activity.
[0005] [0005] In an implementation, emptying measures may include a categorization of write buffer entries as "collapsed". A collapsed write input and buffer and the write collapsed operations here can include at least one write operation that has overwritten data that was written by a previous write operation on the input and buffer. These entries can continue to accumulate write operations which overwrite previous data, and thus at least some of the data may be temporary data that is not to be accessed again soon. For example, write operations to the write buffer entry may be part of a register spill area in memory, where processor register values are written to become the registers available for storing other data. Collapsed write buffer entries may not be considered when determining whether the full write buffer occupancy has reached a specific threshold at which the combined write operations in one or more write buffer entries are transmitted to the next level of memory . Collapsed entries in the buffer can be temporarily ignored when calculating the limit.
[0006] [0006] In another implementation, the combination write buffer can maintain the full write buffer occupancy limit as an emptying measure. The buffer can monitor buffer overrun events. If a buffer overrun event is detected, then the threshold may be too high for the current activity level and may be reduced. On the other hand, if a number of consecutive write operations are received in the buffer without detecting a buffer overrun event, then the limit may be too low and the limit may be increased. Therefore, based on the full real buffer occupation that is detected over time, the limit can be adjusted. BRIEF DESCRIPTION OF THE DRAWINGS
[0007] [0007] The following detailed description makes reference to the associated drawings, which are briefly described, now.
[0008] [0008] Figure 1 is a block diagram of a modality of a processor core, caches and a writing combination buffer.
[0009] [0009] Figure 2 is a flow chart that illustrates an operation of a combination write buffer mode at a high level.
[0010] [00010] Figure 3 is a block diagram of a combination writing buffer modality.
[0011] [00011] Figure 4 is a flowchart that illustrates an operation of a combination write buffer modality shown in figure 3 in response to receiving a write operation.
[0012] [00012] Figure 5 is a flowchart that illustrates an operation of a combination writing buffer modality shown in figure 3 for evaluation of emptying measures.
[0013] [00013] Figure 6 is a block diagram of another modality of the combination writing buffer.
[0014] [00014] Figure 7 is a flow chart that illustrates an operation of a combination writing buffer modality shown in figure 6 in response to receiving the writing operation.
[0015] [00015] Figure 8 is a block diagram of a modality of a system.
[0016] [00016] Although the invention is susceptible to several modifications and alternative forms, the specific modalities of it are shown by way of example in the drawings and will be described here in detail. It should be understood, however, that the drawings and their detailed description are not intended to limit the invention to the particular form exposed, but, on the contrary, the intention is to cover all modifications, equivalents and alternatives falling into the spirit and scope of the present invention, as defined by the embodiments. The titles used here are for organizational purposes only and are not meant to be used to limit the scope of the description. As used throughout this request, the word "power" is used in a permissive sense (that is, meaning it has the potential for it), rather than in the mandatory sense (that is, meaning duty). Similarly, the words "include", "including" and "includes" mean including, but not limiting.
[0017] [00017] Various units, circuits or other components can be described as "configured to" perform a task or tasks. In these contexts, "configured for" is a broad recitation of a structure that generally means "having a circuit that" performs the task or tasks during an operation. As such, the unit / circuit / component can be configured to perform the task, even when the unit / circuit / component is not currently connected. In general, the circuit that forms the corresponding “configured for” structure may include hardware circuits. Similarly, several units / circuits / components can be described as performing a task or tasks, for convenience of description. These descriptions should be interpreted as including the phrase "configured for". Reciting a unit / circuit / component that is configured to perform one or more tasks is expressly intended as not invoking the interpretation of 35 U.S.C. § 112, paragraph six for that (e) unit / circuit / component. DETAILED DESCRIPTION OF MODALITIES
[0018] [00018] Turning now to Figure 1, a block diagram of a modality of a processor core 10, a first level data cache (L1) 12, a second level interface unit (L2) 14 and an L2 cache 16 is shown. The L2 interface unit 14 can include a fill buffer 18 and a combination write buffer (CWB) 20. The CWB 20 can be configured to maintain one or more empty measures 22. The processor core 10 is coupled to the L1 12 data cache, which is coupled to the L2 interface unit 14. The L2 interface unit 14 is additionally coupled to the L2 cache 16, which can be additionally coupled to the next level memory in the memory hierarchy (not shown in figure 1).
[0019] [00019] Processor core 10 can implement any instruction set architecture, and can include the circuit for executing the instructions defined in the instrument set architecture. In various modalities, the processor core 10 can implement any microarchitecture, including superscalar or scalar, super-chained or chained, out of order or in order, speculative or non-speculative, etc. Various modalities can employ microcoding techniques or not, as desired.
[0020] [00020] The instruction set architecture implemented by processor core 10 can specify explicit load instructions defined for transferring data from memory to the processor (for example, to a register on the processor) and defined explicit storage instructions for transferring data from the processor to memory. Any transfer can be completed in cache in several ways. Alternatively or additionally, the instruction set architecture can specify implicit loads and stores (for example, for an instruction that performs a non-load / storage operation on a memory operand). Accordingly, processor core 10 can be said to be performing or performing a loading operation or a storage operation. The load / storage operation can be derived from the explicit instruction or the implicit load / storage.
[0021] [00021] The processor core 10 can be configured to generate a read operation in response to a load operation, and can be configured to generate a write operation in response to a storage operation. Read / write operations can be propagated to a memory hierarchy that includes one or more levels of cache and a main memory system. Aches can cache data that is also stored in the main memory subsystem, and the data in the memory hierarchy is identified by a memory address defined in a memory address space corresponding to the main memory system. For example, in the modality of figure 1, caches L1 and L2 can be memory levels in the memory hierarchy. There may be additional levels, including the main memory level and, optionally, one or more additional levels of cache. Other modalities may not include the L2 cache 16 and the next level of memory from the L1 cache may be the main memory subsystem. Generally, a read / write operation can be entered into the memory hierarchy at the top (the level closest to the processor core 10), and can be programmed from one level to the next until the operation is completed. The main memory subsystem can be the lowest level in the memory hierarchy. Data can be moved to and from the main memory subsystem by various peripheral devices, such as mass storage devices (for example, disk drives) or network devices, but the data is not identified by the memory address on these devices (for example, mass storage devices may have their own address space for finding data on the device, or the network to which the network device is connected may include their own address space identifying devices on the network).
[0022] [00022] The read operation can be completed when the data for the read is returned from the memory hierarchy (for example, any level of cache, or the main memory subsystem), and a write operation can be completed via of the processor core 10 sending the write data. The write operation can include the address, an indication of the writing size (for example, in terms of bytes), and the writing data. The write operation can also include other writing attributes (for example, caching capacity, consistency, etc.).
[0023] [00023] Data cache 12 can implement any capacity and configuration (for example, direct mapping, regulated associative, etc.). Data cache 12 can be configured to allocate and deallocate cache storage in units of cache blocks. A cache block can be any size (for example, 32 bytes, 64 bytes, 128 bytes, etc.) and can be aligned in memory on a natural address boundary for the block size (for example, a cache block of 32 bytes can be aligned with a 32 byte border, a 64 byte cache block can be aligned with a 64 byte border, etc.).
[0024] [00024] In the illustrated embodiment, data cache 12 is synchronously written to the cache and storage (WT). In a synchronous write cache in the cache and in storage, the write operations that reach the cache are propagated to the next level of memory, in addition to updating the cache block in the cache. Write operations that are lacking in cache are also propagated to the next level of memory. On the other hand, an initial write cache in the cache and deferred in storage until a replacement (or storage cache) can update the stored cache block and may not propagate the write operation. Instead, the updated cache block may eventually be written from the initial cache and postponed in storage until a replacement to the next level of memory, when it is dumped from the cache.
[0025] [00025] The L2 interface unit 14 can receive write operations from the L1 12 data cache and can also receive read cache losses (such as filling requests). The L2 interface unit 14 can be configured to store write operations in CWB 20, and to store fill requests in fill buffer 18. Fill requests can be transmitted to the L2 cache 16 (and to levels lower in the memory hierarchy, as needed), and padding data can be returned to the L1 12 data cache and can be written to the L1 12 data cache.
[0026] [00026] The CWB 20 can buffer the write operations, and transmit them to the L2 cache 16 at various points in time. Write operations can include synchronous write writes to the cache and storage that reach cache 12 and updated in the cache block there. Write operations can also include synchronous write writes to the cache and storage that were missing from the data cache 12. The CWB 20 can include multiple buffer entries. Each buffer entry can be configured to store write operations in a cache block granularity. That is, the entry can be allocated to a cache block size entity aligned with the cache block boundary in memory. Any written in the cache block can be stored in the allocated entry. An initial write operation on the cache block can cause the CWB 20 to allocate input, and the write data can be stored in the buffer along with the address and an indication of which bytes in the cache block are updated (for example, a byte mask). Subsequent write operations can be merged into the buffer, writing the data in the appropriate bytes in the cache block and updating the byte mask.
[0027] [00027] The CWB 20 can be configured to accumulate one or more emptying measures 22 to determine when to transmit one or more combined write operations from the buffer entries to the L2 cache 16. The CWB 20 can be configured for monitoring activity in the write buffer for determining the emptying measures 22. Thus, emptying measures can generally be dynamically generated data that can be used by the CWB 20 to determine when to transmit the combined writing operations. (or empty the entry or write buffer entries) to the next level of memory. Because emptying measures are generated dynamically, the frequency at which combined write operations are emptied can vary over time, based on the detected writing buffer activity. That is, emptying measures can be used in conjunction with the full occupation of the write buffer (that is, the number of buffer entries that are occupied, compared to the total number of buffer entries) to determine when to stop. transmit one or more combined write operations to the next level of memory.
[0028] [00028] For example, in one embodiment, emptying measures may include the detection of co-lapsed writing buffer entries. A collapsed write buffer entry can be an entry in which: at least one write operation has been merged into the entry; and the write operation overwrote at least one byte of write data that was written for input by a previous write operation. For example, if a byte mask is maintained to indicate which bytes in the cache block are updated, the collapsed write can be detected, if a write operation is merged into the input and at least one bit of the byte mask would be regulated because the write merger operation is already regulated. Other modalities can detect the collapse at other levels of granularity with the cache block (for example, word, double word, etc.). Other modalities can only detect a collapsed writing if all the bytes updated by the writing have their corresponding mask bits set before the collapsed writing. That is, a collapsed write can be detected if the byte mask has the same value before and after the collapsed write is merged.
[0029] [00029] CWB 20 can be configured to remove collapsed write buffer entries for consideration in detecting full write buffer occupation. For example, CWB 20 can be configured to transmit combined write operations from one or more write buffer entries, as CWB 20 approaches full occupancy (for example, when a full occupancy threshold level is encountered ). Since collapsed write buffer entries are not considered in determining full write buffer occupancy for depleted write entries, the write buffer may tend to become more busy (in terms of occupied entries) when buffer entries collapsed write buffer are detected than when collapsed write buffer entries are not detected in the write buffer. In one embodiment, a fixed or programmable threshold value can be used to determine that CWB 20 is approaching full occupancy. The count of inputs that are in use can be compared to the threshold value, minus any inputs that are in the collapsed state. When the limit value is reached (for example, found or exceeded), the CWB 20 can empty one or more write buffer entries. In one embodiment, the CWB 20 can empty a write buffer entry in response to reaching the threshold value, and can continue to empty the write buffer entries until the number of occupied write buffer entries falls below the value of limit. In other embodiments, more than one write buffer entry can be emptied concurrently. Since collapsed writing buffer entries are not counted towards the full occupancy limit, writing buffer entries can be emptied less frequently than when there are no collapsed writing buffer entries.
[0030] [00030] Emptying a write buffer entry may involve one or more combined write operations. For example, for each set of contiguous updated bytes in the entry, a write operation can be generated. If there are non-updated byte spaces in the entry, multiple write operations can be transmitted. In other modalities, a combined write operation by writing input can be transmitted with a byte mask or other indication identifying which bytes must be updated in the cache block. In still other modalities, scripts of a given size (for example, a word) can be generated.
[0031] [00031] In another embodiment, the drainage measures 22 may include the limit value. The limit value can be modified dynamically based on the detection of write buffer full occupancy events. The events of write buffer full occupancy can indicate that the write buffer is really fully occupied (all buffer entries allocated for write operations). As such, CWB 20 can empty write entries based on the threshold value, and can detect buffer overrun events to indicate that the threshold value is to be modified. For example, if a buffer overrun event is detected, the CWB 20 may determine that the threshold value is too high (for example, too close to the buffer overrun), causing the write buffer to fill before of emptying an entry can be completed. The fully occupied buffer can impact the performance of the processor core. Therefore, the limit value can be reduced in response to the buffer overrun event. On the other hand, if a number of write operations are written to the buffer without detecting a buffer overrun event, the threshold value may be too low (for example, too far from the buffer overrun) and may be increased.
[0032] [00032] Therefore, the limit can be adapted over time, based on whether the buffer is becoming fully occupied or not. If traffic is causing the buffer to fill more quickly, the limit may be reduced. In this way, the writings may not have a backup in the buffer and cause the processor core 10 to freeze. If traffic is causing the buffer to fill less quickly, the limit can be increased. Therefore, the buffer may be allowed to store write operations for longer, reducing traffic (and power consumption) in the L2 16 cache. That is, the frequency of transmission of writes from a write buffer entry for the L2 cache 16 it can increase and decrease with changes in the limit.
[0033] [00033] In other modalities, other emptying measures can be accumulated (for example, how often a cache block fully occupied by writings is stored in buffers, snoops on buffers, etc.). The frequency of emptying the write buffer entries for the L2 cache 16 can be modified based on these measures in the same way. For example, the emptying frequency can be increased, if eavesdropping hits are being detected (indicating that other processor cores are using the data being written).
[0034] [00034] In some embodiments, write operations stored in CWB 20 may still include write operations that cannot be stored in cache. The non-cacheable write operations can be combinable in writing, and can be merged into a buffer entry similar to the synchronous write cache and storage writes discussed above. Other non-cacheable write operations may not be combinable in writing (or a combination of writing non-cacheable scripts may not be supported). In such cases, each non-cacheable write operation can be allocated its own separate entry in CWB 20. Other modalities can store non-cacheable write operations in a different write buffer.
[0035] [00035] The L2 16 cache can be of any size and construction, similar to the above discussion for data cache 12. The L2 16 cache can be initial written in the cache and postponed in storage until a replacement or from synchronous writing in cache and storage, in various modalities. The L2 cache 16 can also include an interface for the next level of memory, which can be the main memory subsystem or a third level cache (L3) in several modalities.
[0036] [00036] It is noted that a combination write buffer (CWB) 20 can be included among other levels of the memory hierarchy in the same way. For example, a CWB 20 can be included below any cache level that is synchronously written to the cache and storage. It is noted that, in one embodiment, the processor core 10, the L1 data cache 12 and the L2 interface 14 (including the fill buffer 18 and the CWB 20) can be integrated together as a processor. In another embodiment, the L2 interface 14 can be shared with another processor core 10 / L1 data cache 12. In still other embodiments, the L2 cache 16 can be integrated into the processor, and / or other components can be integrated (for example, in a system in a chip configuration).
[0037] [00037] Turning, then, to figure 2, a flow chart illustrating an operation of a CWB 20 modality is shown. Although blocks are shown in a particular order for ease of understanding, other orders can be used. The blocks can be performed in parallel in a combinatorial logic in CWB 20. Blocks, block combinations and / or the flowchart as a whole can be linked by multiple clock cycles. The CWB 20 can be configured to implement the operation shown in figure 2.
[0038] [00038] CWB 20 can be configured to monitor activity in the write buffer (block 30). For example, full buffer occupation, numbers of write operations merged into the buffer, collapsed writes, etc. can be monitored by CWB 20. If the detected activity indicates a change in an emptying measure maintained by CWB 20 (decision block 32, “yes” branch), CWB 20 can modify the emptying measure (block 34). If a combination of the buffer state and the emptying measure (s) 22 indicates an emptying of one or more buffer entries (decision block 36, “yes” branch), the CWB 20 can be configured to transmitting one or more combined write operations from one or more buffer entries to the L2 cache 16 (block 38). More generally, write operations can be passed on to the next level of memory in the memory hierarchy.
[0039] [00039] Turning now to Figure 3, a block diagram of a CWB 20 modality is shown. In the embodiment of figure 3, CWB 20 includes a control circuit 40 coupled to a writing buffer 42. Buffer 42 is coupled to receive write operations from the L1 data cache 12, and to provide write operations. combined writing to the L2 cache 16 (or, more generally, to the next level of memory in the memory hierarchy). Control circuit 40 includes a set of registers 44A to 44C, which can store a collapse age limit (CAge), an age limit, and a full occupancy limit, respectively. Registers 44A to 44C can be implemented as one register or multiple registers in general, and can be software addressable for programmability in some modalities. In other embodiments, one or more of the limits can be fixed.
[0040] [00040] Example entries 46A to 46B in buffer 42 are shown in figure 3 and each entry includes an address field (A), a data field (D), a byte mask field (Byte Mask), an age counter field (Age Control), and a collapsed state field (Collapsed). Additional entries for the illustrated entries can be included in buffer 42. Taken together, the collapsed states across all entries can represent an emptying measure 22. The address field can store the address of the cache block represented in the entry, and the data field can include storage for the data cache block, although the entire cache block may not be valid. That is, the entry can store a partial cache block of valid data at any given time. The byte mask field can include a bit for each byte in the cache block. The bit can indicate whether or not the corresponding byte is valid in the data field (that is, whether the byte was written or not by a write operation represented in the input). In one mode, the mask bit can be set to indicate that the byte is valid and can be released to indicate invalid, although other modes can use the opposite meanings for the regulation and release states. The age counter can indicate the age of the entry. The age counter can initially be set to zero, and can be incremented for each clock cycle in which the writing is in buffer 42 or for each writing operation that is presented for buffer 42. In other embodiments, the age counter can be initialized / reset to a defined and decremented value. The collapsed state can indicate whether the input is collapsed or not. That is, the collapsed state can indicate whether or not at least one collapsed script was detected for the entry or not. The collapsed state can be, for example, a bit indicating when regulated that the input is collapsed and indicating when released that the input is not collapsed (or vice versa). Other modalities may use other state indications.
[0041] [00041] Turning now to figure 4, a flowchart is shown illustrating an operation of the CWB 20 modality illustrated in figure 3 in response to receiving a write operation from data cache 12. Although blocks are shown in a particular order for ease of understanding, other orders can be used. The blocks can be carried out in parallel in a combinatorial logic in CWB 20. Blocks, combinations of blocks and / or of the flowchart as a whole can be linked by multiple clock cycles. The CWB 20 and, in particular, the control circuit 40 can be configured to implement the operation shown in figure 4.
[0042] [00042] CWB 20 can be configured to compare the address of the write operation and the address in the write buffer 42 (in a cache line granularity). For example, the address field of the entries in the write buffer 42 can be implemented as a content addressable memory (CAM). If the write operation is a hit on a buffer input (that is, the write operation is in the cache block represented by the input - decision block 50, “yes” branch), control circuit 40 can be configured to resetting the age counter at the entrance (block 52). Thus, the age counter can be the age of the entry since the most recent write operation was merged into the entry in this mode. If the write operation overwrites at least one byte that was already written in the input by a previous write operation (decision block 54, branch “yes”), the control circuit 40 can be configured to regulate the collapsed state to indicate collapsed (block 56). The control circuit 40 can be configured to update the byte mask and cause the data to be written in the data field in the adjustment input.
[0043] [00043] If the write operation is a fault in buffer 42 (decision block 50, branch “no”), the control circuit 40 can be configured to allocate a new input (currently unoccupied) for the write operation ( block 60). The control circuit 40 can initiate the allocated input with information corresponding to the write operation (block 62). In particular, control circuit 40 can cause the allocated input to update in terms of the address and data of the write operation, it can adjust the byte mask to indicate the bytes updated by the write, it can free the age counter, and it can free the age counter. the collapsed state. If the write buffer is fully occupied (that is, there is no currently unoccupied input), control circuit 40 may create a backlash in the L1 12 / processor 10 data cache to stop the write operation until a input is available.
[0044] [00044] Decision block 54 and regulated collapsed state 56 can be the equivalent of decision block 32 and block 34, respectively, for the CWB 20 mode shown in figure 3.
[0045] [00045] Turning now to figure 5, a flowchart is shown illustrating the operation of a CWB 20 modality illustrated in figure 3 for evaluating buffer inputs and determining emptying events. Although blocks are shown in a particular order for ease of understanding, other orders can be used. The blocks can be performed in parallel in a combinatorial logic in CWB 20. Blocks, combinations of blocks and / or the flowchart as a whole can be linked by multiple clock cycles. The CWB 20 and, in particular, the control circuit 40 can be configured to implement the operation shown in figure 5.
[0046] [00046] Control circuit 40 can be configured to determine a full occupancy count as the number of occupied inputs minus the number of collapsed inputs (block 70). That is, the full occupancy count can be the number of non-collapsed entries. If the full occupancy count has reached the full occupancy limit 44C (decision block 72, “yes” branch), the control circuit 40 can be configured to transmit the combined write operations from one or more inputs to the L2 cache 16, or the next level of the memory hierarchy (block 74). Control circuit 40 can be configured to select any input for transmitting combined write operations to the L2 cache 16. For example, in one mode, control circuit 40 can select the oldest input (as indicated by the age counter) that is not a collapsed entry. In another embodiment, collapsed and non-collapsed entries can be considered for age-based selection. In another mode, entries can be emptied in a first-in, first-out (FIFO) order of their allocation. The combination of blocks 70 and 72 can be the equivalent of block 36, and block 74 can be the equivalent of block 38, in this mode. In another mode, an additional limit can be set (higher than the full occupancy limit) to be compared against the total number of occupied entries (collapsed and non-collapsed). If the total number of occupied inputs reaches the additional limit, the control circuit 40 can be configured to transmit combined write operations from one or more buffer entries to the L2 cache 16 (block 74).
[0047] [00047] The rest of the flowchart illustrated in figure 5 can be applied to each buffer entry in buffer 42 (for example, in parallel for each buffer entry). If the age counter has reached the collapsed age limit has reached the collapsed age limit 44A (decision block 76, “yes” branch), control circuit 40 can be configured to reset the collapsed state at the input, indicating no collapsed (block 78). Thus, once the age counter is reset at each writing entry to enter this mode, collapsed writing will no longer be considered collapsed after a number of clock cycles equal to the collapsed age limit has elapsed without another written at the entrance. If the age counter has reached the age limit 44B (decision block 80, branch “yes”), control circuit 40 can be configured to empty the input (block 82), transmitting one or more combined writing operations for The entrance. Blocks 80 and 82 can be another equivalent for blocks 36 and 38, respectively, for this modality.
[0048] [00048] Turning now to figure 6, a block diagram of another modality of CWB 20 is shown. In the embodiment of figure 6, CWB 20 includes a control circuit 90 coupled to a write buffer 92. Buffer 92 is coupled to receive combined write operations from the L1 data cache 12, and to provide write operations combined for the L2 cache 16. Control circuit 90 includes a set of registers 94A to 94C, which can store a write limit, a write count and a full occupancy limit, respectively. Registers 94A through 94C can be implemented as one register or more than one register in general, and can be software addressable for programmability in some modalities. In other embodiments, one or more of the limits can be fixed.
[0049] [00049] In the embodiment of figure 6, the combination of write count 94B and full occupancy limit 94C can be a measure of emptying 22. Write count 94B can be a count of write operations that have been stored in buffer 92 since the most recent full buffer event. The 94C full occupancy limit can be the limit of buffer inputs to be occupied before an emptying is performed, in this mode. The 94C full occupancy limit can be varied based on a write buffer activity, as discussed below.
[0050] [00050] Example entries 96A to 96B are shown in figure 6. Additional entries similar to the illustrated entries can be included. The embodiment of figure 6 includes an address field (A), a data field (D), a byte mask field (Byte Mask) and an age counter field (Age Counter) similar to the same fields described above with respect to figure 3.
[0051] [00051] Figure 7 is a flow chart illustrating the operation of a CWB 20 modality illustrated in figure 6 in response to receiving a write operation. Although blocks are shown in a particular order for ease of understanding, other orders can be used. The blocks can be performed in parallel in a combinatorial logic in CWB 20. Blocks, combinations of blocks and / or the flowchart as a whole can be linked by multiple clock cycles. The CWB 20 and, in particular, the control circuit 90 can be configured to implement the operation shown in figure 7. In addition to the operation shown in figure 7, the mode in figure 6 can detect a hit or an error in buffer 92 and can update the entries accordingly, as shown in blocks 50, 52, 58, 60 and 62 in figure 4, and you can implement blocks 72 and 74 in figure 5 in the same way. Optionally, the embodiment of figure 6 can also implement block 80 and 82 of figure 5 in some embodiments.
[0052] [00052] Control circuit 90 can be configured to determine whether the received write operation fills buffer 92 (decision block 100). For example, if the write operation received is an error in buffer 92 and the last unoccupied entry is allocated to the write operation received, buffer 92 is fully occupied. If so (decision block 100, branch “100”), control circuit 90 can be configured to reduce the full occupancy limit 94C (block 102) and clear write count 94B (block 104). On the other hand (decision block 100, branch “no”), the control circuit 90 can be configured to increase the write count 94B, if the received write operation does not cause a full occupancy event (block 106). If the write count has reached the write limit (decision block 108, branch “108”), control circuit 90 can be configured to increase the full occupancy limit. In this embodiment, the flowchart of figure 7 can be equivalent to blocks 32 and 34 in figure 3.
[0053] [00053] Therefore, the full occupancy limit can be modified dynamically in this mode, to cause an emptying of an entry (as illustrated in blocks 72 and 74, where the full occupancy count is the number of occupied entries in this modality) to prevent full occupancy events from occurring, while allowing buffer 92 to be as fully occupied as possible, based on the traffic detected on the CWB 20. The write count can be determined in any desired way. For example, if a given percentage of write operations is expected to be merged into write buffer entries, the write count can be equal to the product of the number of write operations per cache block (for example, the number of words in the cache block), the number of write buffer entries and the percentage of fusion.
[0054] [00054] Turning then to figure 8, a block diagram of a modality of a 350 system is shown. In the illustrated embodiment, system 350 includes at least one instance of an integrated circuit 358 coupled to an external memory 352. External memory 352 can form the main memory subsystem discussed above with respect to figure 1. Integrated circuit 358 can include at least minus the processor core 10 and the L1 data cache 12 shown in figure 1, and can include one or more of the L2 interface unit 14 and the L2 cache 16. The integrated circuit 358 can still include other components, as desired. Integrated circuit 358 is coupled to one or more peripherals 354 and external memory 352. A power supply 356 is also provided, which supplies the supply voltages for integrated circuit 358, as well as one or more supply voltages for the memory 352 and / or peripherals 354. In some embodiments, more than one instance of integrated circuit 358 can be included (and more than one external memory 352 can be included in the same way).
[0055] [00055] The 352 memory can be any type of memory such as a dynamic random access memory (DRAM), a synchronous DRAM (SDRAM), a double data rate SDRAM (DDR, DDR2, DDR3, etc.) (including mobile versions of SDRAMs, such as mDDR3, etc., and / or low power versions of SDRAMs, such as DPDDR2, etc.), RAMBUS DRAM (RDRAM), static RAM (SRAM), etc. One or more memory devices can be coupled on a circuit board to form memory modules, such as single in-line memory modules (SIMMs), dual in-line memory modules (DIMMs), etc. Alternatively, the devices can be assembled with a 358 integrated circuit in a chip-on-chip configuration, a packet-on-packet configuration or a multi-chip module configuration.
[0056] [00056] The 354 peripherals can include any desired circuit, depending on the type of system 350. For example, in one embodiment, the system 350 can be a mobile device (for example, a personal digital assistant (PDA), a smartphone, etc. .) and the 354 peripherals may include devices for various types of wireless communication, such as Wi-Fi, Bluetooth, mobile, global positioning system, etc. 354 peripherals can also include additional storage, including RAM storage, solid state storage, or disk storage. 354 peripherals may include user interface devices, such as a display screen, including display screens or multi-touch display screens, a keyboard or other input devices, microphones, speakers, etc. In other embodiments, system 350 can be any type of computing system (for example, a personal desktop computer, a laptop, a workstation, a net top, etc.).
[0057] [00057] Numerous variations and modifications will become evident to those skilled in the art, once the above exposure is fully appreciated. The embodiments are intended to be interpreted as involving all these variations and modifications.

权利要求:
Claims (7)
[0001]
Apparatus, comprising: a cache (12); a write buffer (42) coupled to the cache (12) and configured to buffer write operations that access the cache (12), wherein the write buffer (42) comprises a plurality of entries (46A, 46B), each entry (46A, 46B) configured to combine write operations in a cache block granularity (12); and a control circuit (40) coupled to the write buffer (42), where the control circuit (40) is configured to cause the write buffer (42) to transmit one or more combined write operations from one or more entries (46A, 46B) of the plurality of entries (46A, 46B) for a next level of memory (16) below the cache (12), in response to one or more emptying measures (22) applied by the control circuit (40) and in response to reaching a limit level of full occupation of the writing buffer (42), characterized by the fact that: the control circuit (40) is configured to dynamically modify one or more emptying measures (22) in response to an activity in the write buffer (42), in which the dynamic modification of one or more emptying measures (22) changes a frequency at which the control circuit (40) causes the transmission of combined writing operations from the writing buffer (42) to the next memory level (16), and at which one or more emptying measures (22) comprise a collapse state at each entry (46A, 46B) of the plurality of entries (46A, 46B), where the collapse state indicates whether at least one collapsed write was detected or not in the combined write operations on that entry (46A, 46B), and where a collapsed write is a write operation that overlays data written by a previous write operation on that input (46A, 46B), and where the control circuit (40) is configured to cause the transmission of a or more combined write operations and m response to the number of occupied buffer inputs (46A, 46B) that are not in a collapsed state reaching the full occupancy limit level.
[0002]
Apparatus according to claim 1, characterized by the fact that, in addition, the control circuit (40) is still configured to detect the collapsed writing for a first input (46A), and modify the first collapsed state of the first input ( 46A) to indicate collapsed.
[0003]
Apparatus according to claim 2, characterized by the fact that it still comprises an age counter corresponding to each input (46A, 46B) of the plurality of inputs (46A, 46B), in which the control circuit (40) is configured to modify the collapse state at a second input (46B) of the plurality of inputs (46A, 46B) to indicate not collapsed in response to the age counter reaching a second limit, and where the control circuit (40) is configured to reset the age counter in response to a write operation reaching the second entry (46B).
[0004]
Apparatus according to any one of claims 1 to 3, characterized by the fact that it still comprises: a processor core (10) configured to perform storage operations and to generate write operations in response to storage operations; and a second level cache (16) coupled to receive the combined write operations from the write buffer (42) and configured to update the cache blocks stored there with the combined write operations; and wherein the cache (12) is a synchronous write cache (12).
[0005]
Method, which comprises the steps of: a control circuit (40) to monitor activity in a write buffer (42); in response to the activity, the control circuit (40) modifies one or more deflation measures (22) maintained by the control circuit (40); and the control circuit (40) causes one or more write operations from at least one buffer input (46A, 46B) in the write buffer (42) to be written to a next level of memory (16), in response to one or more emptying measures (22) and additionally in response to reaching a limit level of full occupation of the writing buffer (42), characterized by the fact that modifying one or more emptying measures (22) changes a frequency at which the control circuit (40) causes one or more write operations to be written at the next memory level (16), and in that the one or more emptying measures (22) comprise a collapse state at each entry (46A, 46B) of the plurality of entries (46A, 46B), where the collapse state indicates whether at least one collapsed writing has been detected or not in the combined write operations on that input (46A, 46B), and where a collapsed write is a write operation that overlays data written by a previous write operation on that input (46A, 46B), and where the circuit control panel (40) is configured to cause the transmission of one or more combined write operations in response to the number of occupied buffer inputs (46A, 46B) that are not in a collapsed state reaching the full occupancy limit level.
[0006]
Method, according to claim 5, characterized by the fact that the monitoring comprises detecting a first write operation that reaches a first input (46A) of the write buffer (42) and that updates at least one byte that is already updated in the first buffer input (46A), and in which the one or more emptying measures (22) comprise a state in each buffer input (46A, 46B), and in which the modification comprises regulating a state in the first input (46A) buffer indicating detection.
[0007]
Method, according to claim 6, characterized by the fact that it still comprises the comparison of a number of write buffer entries (46A, 46B) that are storing write operations with the full occupancy limit level, where the first buffer input (46A) is excluded from the number.

类似技术:

公开号 | 公开日 | 专利标题

BR112013003850B1|2020-12-01|device and combination writing buffer method with dynamically adjustable emptying measures.

US9223710B2|2015-12-29|Read-write partitioning of cache memory

US8966222B2|2015-02-24|Message passing in a cluster-on-chip computing environment

US6366984B1|2002-04-02|Write combining buffer that supports snoop request

US8230179B2|2012-07-24|Administering non-cacheable memory load instructions

KR101355105B1|2014-01-23|Shared virtual memory management apparatus for securing cache-coherent

US9836406B2|2017-12-05|Dynamic victim cache policy

US20160055099A1|2016-02-25|Least Recently Used Mechanism for Cache Line Eviction from a Cache Memory

KR20170130390A|2017-11-28|Memory controller for multi-level system memory with coherent units

US9087561B2|2015-07-21|Hybrid cache

US9684595B2|2017-06-20|Adaptive hierarchical cache policy in a microprocessor

US10282292B2|2019-05-07|Cluster-based migration in a multi-level memory hierarchy

US20090006777A1|2009-01-01|Apparatus for reducing cache latency while preserving cache bandwidth in a cache subsystem of a processor

US9465740B2|2016-10-11|Coherence processing with pre-kill mechanism to avoid duplicated transaction identifiers

JP2016057763A|2016-04-21|Cache device and processor

US9244841B2|2016-01-26|Merging eviction and fill buffers for cache line transactions

US9715452B2|2017-07-25|Methods to reduce memory foot-print of NUMA aware structures and data variables

US20150067246A1|2015-03-05|Coherence processing employing black box duplicate tags

EP2420933A1|2012-02-22|Combining write buffer with dynamically adjustable flush metrics

US9454482B2|2016-09-27|Duplicate tag structure employing single-port tag RAM and dual-port state RAM

US11016905B1|2021-05-25|Storage class memory access

同族专利:

公开号 | 公开日

AU2011292293B2|2014-02-06|

KR101335860B1|2013-12-02|

JP5621048B2|2014-11-05|

US8352685B2|2013-01-08|

WO2012024158A1|2012-02-23|

US20120047332A1|2012-02-23|

JP2013536526A|2013-09-19|

CN103069400B|2015-07-01|

US20130103906A1|2013-04-25|

US8566528B2|2013-10-22|

CN103069400A|2013-04-24|

MX2013001941A|2013-03-18|

KR20120018100A|2012-02-29|

BR112013003850A2|2016-07-05|

AU2011292293A1|2013-02-21|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US5664106A|1993-06-04|1997-09-02|Digital Equipment Corporation|Phase-space surface representation of server computer performance in a computer network|

US5630075A|1993-12-30|1997-05-13|Intel Corporation|Write combining buffer for sequentially addressed partial line operations originating from a single instruction|

US5561780A|1993-12-30|1996-10-01|Intel Corporation|Method and apparatus for combining uncacheable write data into cache-line-sized write buffers|

US6167473A|1997-05-23|2000-12-26|New Moon Systems, Inc.|System for detecting peripheral input activity and dynamically adjusting flushing rate of corresponding output device in response to detected activity level of the input device|

US6546462B1|1999-12-30|2003-04-08|Intel Corporation|CLFLUSH micro-architectural implementation method and system|

US7020587B1|2000-06-30|2006-03-28|Microsoft Corporation|Method and apparatus for generating and managing a language model data structure|

US6671747B1|2000-08-03|2003-12-30|Apple Computer, Inc.|System, apparatus, method, and computer program for execution-order preserving uncached write combine operation|

US6658533B1|2000-09-21|2003-12-02|Intel Corporation|Method and apparatus for write cache flush and fill mechanisms|

US7191349B2|2002-12-26|2007-03-13|Intel Corporation|Mechanism for processor power state aware distribution of lowest priority interrupt|

JP3785165B2|2003-07-07|2006-06-14|株式会社東芝|Disk array device and intra-cabinet replication method|

KR100515059B1|2003-07-22|2005-09-14|삼성전자주식회사|Multiprocessor system and method to maintain cache coherence therefor|

JP4111910B2|2003-12-26|2008-07-02|富士通株式会社|Disk cache device|

US7353301B2|2004-10-29|2008-04-01|Intel Corporation|Methodology and apparatus for implementing write combining|

US7685372B1|2005-01-13|2010-03-23|Marvell International Ltd.|Transparent level 2 cache controller|

US7444478B2|2005-11-18|2008-10-28|International Business Machines Corporation|Priority scheme for transmitting blocks of data|

US7752173B1|2005-12-16|2010-07-06|Network Appliance, Inc.|Method and apparatus for improving data processing system performance by reducing wasted disk writes|

US8250316B2|2006-06-06|2012-08-21|Seagate Technology Llc|Write caching random data and sequential data simultaneously|

US7962679B2|2007-09-28|2011-06-14|Intel Corporation|Interrupt balancing for multi-core and power|

US7730248B2|2007-12-13|2010-06-01|Texas Instruments Incorporated|Interrupt morphing and configuration, circuits, systems and processes|

US8190826B2|2008-05-28|2012-05-29|Advanced Micro Devices, Inc.|Write combining cache with pipelined synchronization|

TW201015321A|2008-09-25|2010-04-16|Panasonic Corp|Buffer memory device, memory system and data trnsfer method|US20110112798A1|2009-11-06|2011-05-12|Alexander Branover|Controlling performance/power by frequency control of the responding node|

US8977834B2|2011-02-14|2015-03-10|Seagate Technology Llc|Dynamic storage regions|

US8943248B2|2011-03-02|2015-01-27|Texas Instruments Incorporated|Method and system for handling discarded and merged events when monitoring a system bus|

CN102736987A|2011-04-15|2012-10-17|鸿富锦精密工业（深圳）有限公司|Monitoring data caching method and monitoring data caching system|

JP2012234254A|2011-04-28|2012-11-29|Toshiba Corp|Memory system|

US20120284459A1|2011-05-05|2012-11-08|International Business Machines Corporation|Write-through-and-back cache|

JP5492156B2|2011-08-05|2014-05-14|株式会社東芝|Information processing apparatus and cache method|

CN104254841A|2012-04-27|2014-12-31|惠普发展公司，有限责任合伙企业|Shielding a memory device|

US9280479B1|2012-05-22|2016-03-08|Applied Micro Circuits Corporation|Multi-level store merging in a cache and memory hierarchy|

US9645917B2|2012-05-22|2017-05-09|Netapp, Inc.|Specializing I/O access patterns for flash storage|

US9111039B2|2012-08-29|2015-08-18|Apple Ii 'c.|Limiting bandwidth for write transactions across networks of components in computer systems|

JP2014086116A|2012-10-25|2014-05-12|Toshiba Corp|Magnetic disk device and data write method|

JP2014182488A|2013-03-18|2014-09-29|Fujitsu Ltd|Arithmetic processing apparatus and method of controlling the same|

US9423978B2|2013-05-08|2016-08-23|Nexgen Storage, Inc.|Journal management|

US20150006820A1|2013-06-28|2015-01-01|Texas Instruments Incorporated|Dynamic management of write-miss buffer to reduce write-miss traffic|

US9262337B2|2013-10-09|2016-02-16|Microsoft Technology Licensing, Llc|Dynamically determining a translation lookaside buffer flush promotion threshold value|

US9798631B2|2014-02-04|2017-10-24|Microsoft Technology Licensing, Llc|Block storage by decoupling ordering from durability|

US10013501B2|2015-10-26|2018-07-03|Salesforce.Com, Inc.|In-memory cache for web application data|

US9858187B2|2015-10-26|2018-01-02|Salesforce.Com, Inc.|Buffering request data for in-memory cache|

US9984002B2|2015-10-26|2018-05-29|Salesforce.Com, Inc.|Visibility parameters for an in-memory cache|

US9990400B2|2015-10-26|2018-06-05|Salesforce.Com, Inc.|Builder program code for in-memory cache|

US20170249257A1|2016-02-29|2017-08-31|Itu Business Development A/S|Solid-state storage device flash translation layer|

US10310997B2|2016-09-22|2019-06-04|Advanced Micro Devices, Inc.|System and method for dynamically allocating memory to hold pending write requests|

US10915498B2|2017-03-30|2021-02-09|International Business Machines Corporation|Dynamically managing a high speed storage tier of a data storage system|

US10795575B2|2017-03-31|2020-10-06|International Business Machines Corporation|Dynamically reacting to events within a data storage system|

US10289552B2|2017-05-03|2019-05-14|Western Digital Technologies, Inc.|Storage system and method for flush optimization|

US10642745B2|2018-01-04|2020-05-05|Salesforce.Com, Inc.|Key invalidation in cache systems|

US11194722B2|2018-03-15|2021-12-07|Intel Corporation|Apparatus and method for improved cache utilization and efficiency on a many core processor|

KR20200046495A|2018-10-24|2020-05-07|에스케이하이닉스 주식회사|Memory system and operating method thereof|

KR20200060154A|2018-11-22|2020-05-29|에스케이하이닉스 주식회사|Memory controller and operating method thereof|

US20200192805A1|2018-12-18|2020-06-18|Western Digital Technologies, Inc.|Adaptive Cache Commit Delay for Write Aggregation|

US11210100B2|2019-01-08|2021-12-28|Apple Inc.|Coprocessor operation bundling|

US11256622B2|2020-05-08|2022-02-22|Apple Inc.|Dynamic adaptive drain for write combining buffer|

US10802762B1|2020-06-08|2020-10-13|Open Drives LLC|Systems and methods for asynchronous writing of synchronous write requests based on a dynamic write threshold|

法律状态:
2018-12-26| B06F| Objections, documents and/or translations needed after an examination request according [chapter 6.6 patent gazette]|

2020-08-11| B09A| Decision: intention to grant [chapter 9.1 patent gazette]|

2020-12-01| B16A| Patent or certificate of addition of invention granted [chapter 16.1 patent gazette]|Free format text: PRAZO DE VALIDADE: 20 (VINTE) ANOS CONTADOS A PARTIR DE 11/08/2011, OBSERVADAS AS CONDICOES LEGAIS. |

优先权:

申请号 | 申请日 | 专利标题

US12/860,505|US8352685B2|2010-08-20|2010-08-20|Combining write buffer with dynamically adjustable flush metrics|

US12/860,505|2010-08-20|

PCT/US2011/047389|WO2012024158A1|2010-08-20|2011-08-11|Combining write buffer with dynamically adjustable flush metrics|

[返回顶部]